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ABSTRACT 

A central problem in artificial intelligence is that of plan- 
ning to maximize future reward under uncertainty in a par- 
tially observable environment. In this paper we propose and 
demonstrate a novel algorithm which accurately learns a 
model of such an environment directly from sequences of 
action-observation pairs. We then close the loop from ob- 
servations to actions by planning in the learned model and 
recovering a policy which is near-optimal in the original 
environment. Specifically, we present an efficient and sta- 
tistically consistent spectral algorithm for learning the pa- 
rameters of a Predictive State Representation (PSR). We 
demonstrate the algorithm by learning a model of a simu- 
lated high-dimensional, vision-based mobile robot planning 
task, and then perform approximate point-based planning 
in the learned PSR. Analysis of our results shows that the 
algorithm learns a state space which efficiently captures the 
essential features of the environment. This representation 
allows accurate prediction with a small number of parame- 
ters, and enables successful and efficient planning. 

1. INTRODUCTION 

Planning a sequence of actions or a policy to maximize fu- 
ture reward has long been considered a fundamental problem 
for autonomous agents. For many years, Partially Observ- 
able Markov Decision Processes (POMDPs) [I] [27] [4] have 
been considered the most general framework for single agent 
planning. POMDPs model the state of the world as a latent 
variable and explicitly reason about uncertainty in both ac- 
tion effects and state observability. Plans in POMDPs are 
expressed as policies, which specify the action to take given 
any possible probability distribution over state. Unfortu- 
nately, exact planning algorithms such as value iteration [27] 
are computationally intractable for most realistic POMDP 
planning problems. There are arguably two primary reasons 
for this [18]. The first is the "curse of dimensionality": for 
a POMDP with n states, the optimal policy is a function of 
ann-1 dimensional distribution over latent state. The sec- 
ond is the "curse of history": the number of distinct policies 
increases exponentially in the planning horizon. We hope to 
mitigate the curse of dimensionality by seeking a dynamical 
system model with compact dimensionality, and to mitigate 
the curse of history by looking for a model that is susceptible 
to approximate planning. 

Predictive State Representations (PSRs) 13] and the closely 
related Observable Operator Models (OOMs) [9] are gen- 
eralizations of POMDPs that have attracted interest be- 
cause they both have greater representational capacity than 



POMDPs and yield representations that are at least as com- 
pact [24| [5]. In contrast to the latent-variable representa- 
tions of POMDPs, PSRs and OOMs represent the state of a 
dynamical system by tracking occurrence probabilities of a 
set of future events (called tests or characteristic events) 
conditioned on past events (called histories or indicative 
events). Because tests and histories are observable quan- 
tities, it has been suggested that learning PSRs and OOMs 
should be easier than learning POMDPs. A final benefit 
of PSRs and OOMs is that many successful approximate 
planning techniques for POMDPs can be used to plan in 
these observable models with minimal adjustment. Accord- 
ingly, PSR and OOM models of dynamical systems have po- 
tential to overcome both the "curse of dimensionality" (by 
compactly modeling state), and the "curse of history" (by 
applying approximate planning techniques). 

The quality of an optimized policy for a POMDP, PSR, or 
OOM depends strongly on the accuracy of the model: inac- 
curate models typically lead to useless plans. We can specify 
a model manually or learn one from data, but due to the diffi- 
culty of learning, it is far more common to see planning algo- 
rithms applied to manually-specified models. Unfortunately, 
it is usually only possible to hand-specify accurate models for 
small systems where there is extensive and goal-relevant do- 
main knowledge. For example, recent extensions of approx- 
imate planning techniques for PSRs have only been applied 
to models constructed by hand [11| [8], For the most part, 
learning models for planning in partially observable environ- 
ments has been hampered by the inaccuracy of learning al- 
gorithms. For example, Expectation-Maximization (EM) [2] 
does not avoid local minima or scale to large state spaces; 
and, although many learning algorithms have been proposed 
for PSRs [25] [10] [34] [16] [30] |] and OOMs [9] [6] [lZ] that 
attempt to take advantage of the observability of the state 
representation, none have been shown to learn models that 
are accurate enough for planning. As a result, there have 
been few successful attempts at learning a model directly 
from data and then closing the loop by planning in that 
model. 

Several researchers have, however, made progress in the 
problem of planning using a learned model. In one in- 
stance [2l], researchers obtained a POMDP heuristically 
from the output of a model-free algorithm [15] and demon- 
strated planning on a small toy maze. In another instance [20] 
researchers used Markov Chain Monte Carlo (MCMC) in- 
ference both to learn a factored Dynamic Bayesian Network 
(DBN) representation of a POMDP in a small synthetic net- 
work administration domain, as well as to perform online 



planning. Due to the cost of the MCMC sampler used, this 
approach is still impractical for larger models. In a final ex- 
ample, researchers learned Linear-Linear Exponential Fam- 
ily PSRs from an agent traversing a simulated environment, 
and found a policy using a policy gradient technique with 
a parameterized function of the learned PSR staten as in- 
put 33 3l] . In this case both the learning and the planning 
algorithm were subject to local optima. In addition, the au- 
thors determined that the learned model was too inaccurate 
to support value- function-based planning methods [31] . 

The current paper differs from these and other previous 
examples of planning in learned models: it both uses a prin- 
cipled and provably statistically consistent model-learning 
algorithm, and demonstrates positive results on a challeng- 
ing high-dimensional problem with continuous observations. 
In particular, we propose a novel, consistent spectral algo- 
rithm for learning a variant of PSRs called Transformed 
PSRs [19] directly from execution traces. The algorithm 
is closely related to subspace identification for learning lin- 
ear dynamical systems (LDSs) 26, 29] and spectral algo- 
rithms for learning Hidden Markov Models (HMMs) [7] and 

. We then demon- 
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reduced-rank Hidden Markov Models 

strate that this algorithm is able to learn compact models 
of a difficult, realistic dynamical system without any prior 
domain knowledge built into the model or algorithm. Fi- 
nally, we perform point-based approximate value iteration 
in the learned compact models, and demonstrate that the 
greedy policy for the resulting value function works well in 
the original (not the learned) system. To our knowledge this 
is the first research that combines all of these achievements, 
closing the loop from observations to actions in an unknown 
domain with no human intervention beyond collecting the 
raw transition data. 

2. PREDICTIVE STATE REPRESENTATIONS 

A predictive state representation (PSR) [13] is a compact 
and complete description of a dynamical system that repre- 
sents state as a set of predictions of observable experiments 
or tests that one could perform in the system. Specifically, a 
test of length k is an ordered sequence of action-observation 
pairs t = a 1 o 1 . . . a^Ok that can be executed and observed 
at a given time. Likewise, a history is an ordered sequence 
of action-observation pairs h — a\ of . . . at Ot that has been 
executed and observed prior to a given time. The prediction 
for a test r is the probability of the sequence of observations 
01, ... ,Ok being generated, given that we intervene to take 
the sequence of actions a\ , . . . , at . If the observations pro- 
duced by the dynamical system match those specified by the 
test, then the test is said to have succeeded. The key idea 
behind a PSR is that, if the expected outcomes of execut- 
ing all possible tests are known, then everything there is to 
know about the state of a dynamical system is also known. 

In PSRs, actions in tests are interventions, not observa- 
tions. Thus it is notationally convenient to separate a test 
t into the observation component r° and the action com- 
ponent t a . In equations that contain probabilities, a single 
vertical bar | indicates conditioning and a double vertical 
bar || indicates intervening. For example, p(rf|/i| \t a ) is the 
probability of the observations in test n, conditioned on his- 
tory h, and given that we intervene to execute the actions 
in Tj. 

Formally a PSR consists of five elements {^4, 0, Q, mi, F}. 
A is the set of actions that can be executed at each time- 



step, O is the set of possible observations, and Q is a set of 
core tests. A set of core tests Q has the property that for any 
test r, there exists some function f T such that p(r°|/i||r j4 ) = 
,/V(p(Q |/j||Q A )) for all histories h. Here, the prediction 
vector 

p(Q°\h\\Q A ) = tp(9i O |ft||^ 4 ),-,P(«Sll'»ll9i0|)] T C 1 ) 

contains the probabilities of success of the tests in Q. The 
existence of f T means that knowing the probabilities for the 
tests in Q is sufficient for computing the probabilities for all 
other tests, so the prediction vector is a sufficient statistic 
for the system. The vector mi is the initial prediction for 
the outcomes of the tests in Q given some initial distribution 
over histories u. We will allow the initial distribution to be 
general; in practice u> might correspond to the steady state 
distribution for a heuristic exploration policy, or the distri- 
bution over histories when we first encounter the system, or 
the empty history with probability 1. 

In order to maintain predictions in the tests in Q we need 
to compute p(Q° \ ho\\a, Q A ), the distribution over test out- 
comes given a new extended history, from the current distri- 
bution p(Q°\h\\Q A ) (here p(Q°\ho\\a,Q A ) is the probability 
over test outcomes conditioned on history h and observation 
o given the intervention of choosing the immediate next ac- 
tion a and the appropriate actions for the test). Let f aoq be 
the function needed to update our prediction of test q 6 Q 
given an action a and an observation o. (This function is 
guaranteed to exist since we can set r = aoq in f T above.) 
Finally, F is the set of functions f aoq for all a € A, o 6 O, 
and q £ Q. 

In this work we will restrict ourselves to linear PSRs, a 
subset of PSRs where the functions f aoq are required to be 
linear in the prediction vector p(Q°\h\ \Q A ), so that 
fao q {p{Q°\h\\Q A )) = m T aoq p{Q°\h\\Q A ) for some vector 
m aoq G K'^'. 1 We write M ao to be the matrix with rows 
m Z oq - By Bayes' Rule, the update from history h, after 
taking action a and seeing observation o, is: 



p(Q°\ho\\a, 



p(o\h\\a) 
M aoP (Q°\h\\Q A ) 
ml M aoP (Q°\h\\Q A ) 



(2) 



where is a normalizing vector. Specifying a PSR in- 
volves first finding a set of core tests Q, called the discovery 
problem, and then finding the parameters M ao and m^ for 
those tests as well as an initial state mi, called the learn- 
ing problem. The discovery problem is usually solved by 
searching for linearly independent tests by repeatedly per- 
forming Singular Value Decompositions (SVDs) on collec- 
tions of tests [10| |34| . The learning problem is then solved 
by regression. 



1 Linear PSRs hav e be en shown to be a highly expressive 
class of models [9] |24| : if the set of core tests is minimal, 
then the set of PSRs with n = |Q| core tests is provably 
equivalent to the set of dynamical systems with linear di- 
mension n. The linear dimension of a dynamical system is 
a measure of its intrinsic complexity; specifically, it is the 
rank of the system- dynamics matrix 24 of the dynamical 
system. Since there exist dynamical systems of finite linear 
dimension which cannot be modeled by any POMDP (or 
HMM) with a finite number of states (see M for an exam- 
ple), POMDPs and HMMs are a proper subset of PSRs [24]. 



2.1 Transformed PSRs 

Transformed PSRs (TPSRs) p] are a generalization of 
PSRs that maintain a small number of linear combinations 
of test probabilities as sufficient statistics of the dynamical 
system. As we will see, transformed PSRs can be thought 
of as linear transformations of regular PSRs. Accordingly, 
TPSRs include PSRs as a special case since this transfor- 
mation can be the identity matrix. The main benefit of 
TPSRs is that given a set of core tests, the parameter learn- 
ing problem can be solved and a large step toward solving 
the discovery problem can be achieved in closed form. In 
this respect, TPSRs are closely related to the transformed 
representations of LDSs and HMMs found by subspace iden- 
tification [29|[26] [7|. 

For some dynamical system, let Q be the minimal set of 
core tests with cardinality n = |Q| equal to the dimension- 
ality of the linear system. Then, let T be a set of core tests 
(not necessarily minimal) and let H be a sufficient set of 
indicative events. A set of indicative events is a mutually 
exclusive and exhaustive partition of the set of all possible 
histories. We will define a sufficient set of indicative events 
below. For TPSRs, |T| and \TL\ may be arbitrarily larger 
than n; in practice we might choose T and Ti by selecting 
sets that we believe to be large enough and varied enough 
to exhibit the types of behavior that we wish to model. 

We define several matrices in terms of T and Ti. In each 
of these matrices we assume that histories H are sampled 
according to u; further actions and observations are specified 
in the individual probability expressions. Pn G R' w ' is a 
vector containing the probabilities of every h £ Ti. 

[P n ]i = Pr[# £ hi] 
= uj(H G hi) 

= 71 'hi 

=> Pn = tv (3a) 

Here we have defined two notations, Pn and 7r, for the same 
vector. Below we will generalize Pn, but keep the same 
meaning for ti. 

Next we define Pr,n £ R' r ' x ' w ', a matrix with entries 
that contain the joint probability of every test Ti G T (1 < 
i < \1~\) and every indicative event hj G Ti (1 < j < \Tt\) 
(assuming we execute test actions t a ): 

[Pr,n]ij = Pr[rf,H G hj\\r A ] 

= Pr[r°\H <Ehj\\rf] Pr[H £ hj] 

= r^Pr[Q°\H e hj \\Q A ] Pr[H G hj] 

= rj.s h Pr[Hehj] 



T,n 



rli Shj 7T 

J R5 , diag( 7 r) 



(3b) 



The vector r Ti is the linear function that specifies the prob- 
ability of the test Ti given the probabilities of core tests Q. 
The vector sjy contains the probabilities of all core tests Q 
given that the history belongs to the indicative event hj . Be- 
cause of our assumptions about the linear dimension of the 
system, the matrix Pr,n factors according to R £ R' T ' X ™ 
(a matrix with rows r£ for all 1 < i < \T\) and 5" G R nx|H| 
(a matrix with columns Shj for all 1 < j < \Ti\). Therefore, 
the rank of Pr,n is no more than the linear dimension of 
the system. At this point we can define a sufficient set of 



indicative events as promised: it is a set of indicative events 
which ensures that the rank of Pr,n is equal to the linear 
dimension of the system. Finally, mi, which we have de- 
fined as the initial prediction for the outcomes of tests in Q 
given some initial distribution over histories h, is given by 
mi — Sn (here we are taking the expectation of the columns 
of S according to the correct distribution over histories uj). 

We define Pr , ao ,n £ R' T ' X ' W ', a set of matrices, one for 
each action-observation pair, that represent the probabilities 
of a triple of an indicative event hj , the immediate following 
observation O, and a subsequent test Tj, given the appropri- 
ate actions: 



IP'. 



T,ao,n\i,j 



rf ] Pr[H G hj 



Pi[Ti,0 = o,H G hj\\A 
= Pr[rf ,0 = o\H G hj\\A 
= Pr[rf \H G hj,0 = o\\A = a,r % i 

Pr[C> = o\H G hj\\A = a] Pr[H G hj] 
= ty. Pt[Q°\H G hj,0 = o\\A = a,Q A ] 
Pr[0 = o\H G hj\\A = a] Pr[H G hj] 
= rl t M ao Pr[Q°\H G hj\\Q A ] Pr[H G hj] 
= rJ t M ao s hj Pr[H G hj] 
= rJ z M ao s h .TT hj 
Pr.ao.n = J RM ao Sdiag(7r) (3c) 



The matrices Pr,ao,n factor according to R and S (defined 



above) and the PSR transition matrix M ao G 



Note 



that R spans the column space of both Pr,n and the matri- 
ces Pr,ao,n', we make use of this fact below. 

Finally, we will use the fact that is a normalizing vec- 
tor to derive the equations below (by repeatedly multiplying 
by S and , and using the facts SS'' = / and mJ S' = 1 T , 
since each column of S is a vector of core-test predictions). 
Here, k = \Ti\ and 1^ denotes the ones-vector of length k: 



T c iT 
T ,T t 

m x = l k S' 



T - lists 



(4a) 
(4b) 

We now define a TPSR in terms of the matrices Pn, Pr,n, 
Pr,ao,n an d an additional matrix U that obeys the condition 
that U T R is invertible. In other words, the columns of U 
define an n-dimensional subspace that is not orthogonal to 
the column space of Pr,n- A natural choice for U is given 
by the left singular vectors of Pr,n- 

With these definitions, we define the parameters of a TPSR 
in terms of observable matrices and simplify the expressions 
using Equations [3ja-c), as follows (here, B ao is a similarity 
transform of the low-dimensional linear transition matrix 
M ao and feiand are the corresponding linear transforma- 
tions of the minimal PSR initial state Mi and the normal- 
izing vector): 

h ee U T Pr,nU 
= U T RSdiag(ir)l k 
= U t RSty 
= (U T R) mi 



(5a) 



bl = P£(cr T JY,w) t 

= l^S t Sdiag(7r)((7 T Pr, W ) t 

= llS t {U T R)- 1 (U T R)Sdi ag (7v)(U T P T ,n)'< 

= llS^U 1 R)' 1 ^ PtMU 1 'Pt,kY 

= llS^^R)- 1 

^ml(U T R)- 1 (5b) 

B ao = U T P T ,ao,H{U T PT,n) t 

= U J RM ao Sdmg(n){U T Pt,h) ] 

= U T RM ao (U T R)- 1 (U T R)Sding(n)(U T P T ,n) t 

= (U J R)M ao (U T R)- 1 U T PrMU T PT,H) i ' 

= ( U T R) M ao ( U T R) 1 (5c) 

The derivation of Equation [5]d makes use of Equations [4^, 
andjlfj. Given these parameters we can calculate the prob- 
ability of observations oi.t at any time t given that we inter- 
vened with actions oi : t, from the initial state mi. Here we 
write the product of each M ao (one for each action observa- 
tion pair) M aiol Ma 2 o 2 ■ ■ ■ M atot as M aoi:t . 

Pr[oi:t||oi:t] = m~l Mao 1 . A m 1 

= m T 00 (U T R)- 1 (U T R)M a01:t (U T R)- 1 (U T R)m 1 
= b T oo B a01t b 1 (6) 

In addition to the initial TPSR state b\, we define normal- 
ized conditional 'internal states' b t . We define the TPSR 
state at time t + 1 as: 



J t+i 



B aoit bi 

bJoBao 1:t bl 



(7) 



We can define a recursive state update for t > 1 as follows 
(using Equation [7] as the base case for t = 1): 



Ot+l 



B aoit bi 
i>l o B a01 . t b 1 

B a ot Baoi t — i bl 

bJ B aot B a 

°1:£— 1 1 

B aot b t 
bJa B aot b t 



(8) 



The prediction of tests p(T°\h\\T A ) at time t is given by 
Ub t = UU T Rs t = Rs t , and the rotation from a TPSR to a 
PSR is given by s t = (U J i?) _1 &t where s t is the prediction 
vector for the PSR. Note that in general, the elements of the 
linear combinations b t cannot be interpreted as probabilities 
since they may lie outside the range [0, 1]. 

3. LEARNING TPSRS 

Our learning algorithm works by building empirical esti- 
mates Pn, Pt,h, and Pr,ao,n of the matrices Pn, Pt,h, and 
PT, ao ,n defined above. To build these estimates, we repeat- 
edly sample a history h from the distribution uj, execute a 
sequence of actions, and record the resulting observations. 
This data gathering strategy implies that we must be able 
to arrange for the system to be in a state corresponding to 
h ~ uj; for example, if our system has a reset, we can take 
uj to be the distribution resulting from executing a fixed 
exploration policy for a few steps after reset. 



In practice, reset is often not available. In this case we 
can estimate Pn, Pt,h> and Pr,ao,n by dividing a single 
long sequence of action-observation pairs into subsequences 
and pretending that each subsequence started with a reset. 
We are forced to use an initial distribution over histories, 
ui, equal to the steady state distribution of the policy which 
generated the data. This approach is called the suffix-history 
algorithm 34 . With this method, the estimated matrices 



will be only approximately correct, since interventions that 
we take at one time will affect the distribution over histories 
at future times; however, the approximation is often a good 
one in practice. 

Once we have computed Pn, Pr,n, and Pr,ao,n, we can 
generate U by singular value decomposition of Pr,n- We can 
then learn the TPSR parameters by plugging U, Pn, Pr,n, 
and Pr,ao.n into Equation]^] For reference, we summarize 
the above steps here 2 : 

1. Compute empirical estimates Pn, Pt ,n,Pr ,ao,n- 

2. Use SVD on Pr,n to compute U, the matrix of left 
singular vectors corresponding to the n largest singular 
values. 

3. Compute model parameter estimates: 

(a) b 1 = U T P H , 

(b) 6^ = (Pi,nGyPn, 

(c) B ao = U T P T ,ao,n(U T PT,nV 

As we include more data in our averages, the law of large 
numbers guarantees that our estimates Pn, Pr,n, and Pr,ao,n 
converge to the true matrices Pn, Pt,h, and Pr,ao,n (de- 
fined in Equation [3| . So by continuity of the formulas in 
steps 3(a-c) above, if our system is truly a TPSR of finite 
rank, our estimates &i, boo, and B ao converge to the true 
parameters up to a linear transform. Although parameters 
estimated with finite data can sometimes lead to negative 
probability estimates when filtering or predicting, this can 
be avoided in practice by thresholding the prediction vectors 
by some small positive probability. 

Note that the learning algorithm presented here is distinct 
from the TPSR learning algorithm presented in Rosencrantz 
et al. 
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The principal difference between the two algo- 
rithms is that here we estimate the joint probability of a 
past event, a current obser vat ion, and a future event in the 
matrix Pr,ao,n whereas in [I9], the authors instead estimate 
the probability of a future event, conditioned on a past event 
and a current observation. To compensate, Rosencrantz et 
al. later multiply this estimate by an approximation of the 
probability of the current observation, conditioned on the 
past event, but not until after the SVD is applied. Rosen- 
crantz et al. also derive the approximate probability of the 
current observation differently: as the result of a regression 
instead of directly from empirical counts. Finally, Rosen- 
crantz et al. do not make any attempt to multiply by the 
marginal probability of the past event, although this term 

2 The learning strategy employed here may be seen as a gen- 
eralization of Hsu et al.'s spectral algorithm for learning 
HMMs [f to PSRs. Note that since HMMs and POMDPs 
are a proper subset of PSRs, we can use the algorithm in 
this paper to learn back both HMMs and POMDPs in PSR 
form. 



cancels in the current work so it is possible that, in the 
absence of estimation errors, both algorithms arrive at the 
same answer. 

Below we present two extensions to our learning algo- 
rithm that preserve consistency while relaxing the require- 
ment that we find a discrete set of indicative events and 
tests. These extensions make learning substantially easier 
for many difficult domains (e.g. for continuous observations) 
in practice. 

3.1 Learning TPSRs with Indicative and Char- 
acteristic Features 

In data gathered from complex real-world dynamical sys- 
tems, it may not be possible to find a reasonably-sized set 
of discrete core tests T or indicative events Ti. When this 
is the case, we can generalize the TPSR learning algorithm 
and work with features of tests and histories, which we call 
characteristic features and indicative features respectively. 
In particular let T and Tt be large sets of tests and indica- 
tive events (possibly too large to work with directly) and 
let <ji T and </> H be shorter vectors of characteristic and in- 
dicative features. The matrices Pn, Pt,h, and Pr,ao,n will 
no longer contain probabilities but rather expected values 
of features or products of features. For the special case of 
features that are indicator functions of tests and histories, 
we recover the TPSR matrices from Section [2 . 1 1 where Pn, 
Pt.h, and Pr,ao.n consist of probabilities. 

Here we prove the consistency of our estimation algorithm 
using these more general matrices as inputs. In the follow- 
ing equations $ T and $ n are matrices of characteristic and 
indicative features respectively, with first dimension equal 
to the number of characteristic or indicative features and 
second dimension equal to |T| and 1 7"^ | respectively. 

An entry of $ H is the expectation of one of the indicative 
features given the occurrence of one of the indicative events. 
An entry of $ T is the weight of one of our tests in calculating 
one of our characteristic features. With these features we 
generalize the matrices P-h, Pt,h, and Pr,ao,n'- 

[Pn], = E(^(fc)) = J2 Pt I H e 
hen 

=> Pn = (9a) 

[P T ,H\i, j =n<t>J{T°)-cff{h)\\T A ) 

tgt hen 

= EE rJa h ir h $T r Q% (by Eq. |3b}) 

tgt hen 

rer hen 



=> Pr,n = $ 7?Sdiag(7r)$ w (9b) 
[Pr,ao,nU = n<th{r°) ■ <ff{h) ■ 5(0 = o)\\t a A = a) 

=J2 J2 Pr { T °°= o - Heh \\ A = a > TA ^^ 

tgt hen 

= E rjM ao 8 h n h $TT®% (by Eq. @) 



VeT / \hen 
PT,ao,n = $ T Mf 00 5diag(7r)$ wT 



Sh^h^Vh 



where 8(0 — o) is an indicator function for a particular 
observation. The parameters of the TPSR are defined in 
terms of a matrix U that obeys the condition that U T & T R 
is invertible (we can take U to be the left singular values of 
Pr,n), and in terms of the matrices Pn, Pr.n, and Pr,ao,n- 
We also define a new vector e s.t. $ w e T = lk\ this means 
that the ones vector 1^ must be in the row space of < I >7i . 
Since $ K is a matrix of features, we can always ensure that 
this is the case by requiring one of our features to be a 
constant. Then, one row of & n is lj, and we can set e T = 
[1 ... ] T . Finally we define the generalized TPSR 
parameters b\, b^, and B ao as follows: 

bi = U T P T ,ne T 

= (7 T $ T i?5'diag(7r)$ HT e T 
= [/ T $ T RS'diag(7r)l fc 
= (U t $ t R)Stt 
= (U T $ T R) mi 
bl = P n (U T P T ,ny 



(10a) 



= l^diag(7r)$ HT ([/ T P r ,nf 
= l^S t Sdiag(7r)cE. HT (l7 T Pr i H) t 

= llS t (U T $ T R)- 1 {U T $ T R)Sdiei g {ir)$ nT (U T PT,ny 

= lIS t (t/ T $ T i?)- 1 t/ T P T ,H(t/ T PT,H) t 

= llS^^^R)- 1 

= ml (U T $ T R)- 1 (10b) 
B ao = U T P T ,aoMU T PT,ny 

= U T $ T RM ao Sding(n)$ nJ (U T PT,n) i 

= U r $ T RM ao (U T $ T Ry 1 (U T <i> T R)Sdiag(TT)$ H Xu J PT,n) t 



= {U T $ T R)M ao {U T $ T Ry 1 U T P T ,n(U T PT,- 
= (U T $ T R)M ao (U T $ T Ry 1 



(10c) 



Just as in the beginning of Section [i] we can estimate Pn, 
Pr,n, and Pr,ao,n, and then plug the matrices into Equa- 
a-c). Thus we see that if we work with characteris- 
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tions 

tic and indicative features, and if our system is truly a TPSR 
of finite rank, our estimates bi, b^, and B ao again converge 
to the true PSR parameters up to a linear transform. 

3.2 Kernel Density Estimation for Continuous 
Observations 

For continuous observations, we use Kernel Density Esti- 
mation (KDE) [23] to model the observation probability den- 
sity function (PDF). We use a fraction of the training data 
points as kernel centers, placing one multivariate Gaussian 
kernel at each point. 3 The KDE estimator of the observa- 
tion PDF is a convex combination of these kernels; since 
each kernel integrates to 1, this estimator also integrates to 
1. KDE theory [23] tells us that, with the correct kernel 
weights, as the number of kernel centers and the number 
of samples go to infinity and the kernel bandwidth goes to 



(9c) 



3 We use a general elliptical covariance matrix, chosen by 
PCA: that is, we use a spherical covariance after projecting 
onto the eigenvectors of the covariance matrix of the obser- 
vations, and scaling by the square roots of the eigenvalues. 



zero (at appropriate rates), the KDE estimator converges to 
the observation PDF in L\ norm. The kernel density esti- 
mator is completely determined by the normalized vector of 
kernel weights; therefore, if we can estimate this vector ac- 
curately, our estimate of the observation PDF will converge 
to the observation PDF as well. Hence our goal is to predict 
the correct expected value of this normalized kernel vector 
given all past observations. In the continuous-observation 
case, we can still write our latent-state update in the same 
form, using a matrix B ao ; however, rather than learning 
each of the uncountably-many B ao matrices separately, we 
learn one base operator per kernel center, and use convex 
combinations of these base operators to compute observable 
operators as needed. For more details on practical aspects 
of the learning procedure with continuous observations, see 
Section [Ol 



4. PLANNING IN TPSRS 

The primary motivation for modeling a controlled dynam- 
ical system is for reasoning about the effects of taking a se- 
quence of actions in the system. The TPSR model can be 
augmented for this purpose by specifying a reward function 
for taking an action a in state 6: 



72(6, a) = r/lb 



(11) 



where nj £ R n is the linear reward function for taking action 
a. Given this function and a discount factor 7, the planning 
problem for TPSRs is to find a policy that maximizes the ex- 
pected discounted sum of rewards E [£ t 7*72(64, a*)] . The 
optimal policy can be compactly represented using the op- 
timal value function V* , which is defined recursively as: 



V*(b) — max 



72.(6,0) + 7^p(o|M)V*(& ao ) 



06O 



(12) 



where b ao is the state obtained from b after executing action 
a and observing o. When optimized exactly, this value func- 
tion is always piecewise linear and convex (PWLC) in the 
state and has finitely many pieces in finite-horizon planning 
problems. 4 The optimal action is then obtained by taking 
the arg max instead of the max in Equation |12| 

Exact value iteration in POMDPs or TPSRs optimizes 
the value function over all possible belief or state vectors. 
Computing the exact value function is problematic because 
the number of sequences of actions that must be consid- 
ered grows exponentially with the planning horizon, called 
the "curse of history." Approximate point-based planning 
techniques (see below) attempt only to calculate the best se- 
quence of actions at some finite set of belief points. Unfortu- 
nately, in high dimensions, approximate planning techniques 
have difficulty adequately sampling the space of possible be- 
liefs. This is due to the "curse of dimensionality." Because 
TPSRs often admit a compact low-dimensional representa- 
tion, approximate point-based planning techniques can work 
well in these models. 

Point-Based Value Iteration (PBVI) |17| is an efficient 
approximation of exact value iteration that performs value 



4 This observation follows from that fact that a TPSR is a 
linear transformation of a PSR , and PSRs like POMDPs 
have PWLC value functions 111]. 



backup steps on a finite set of heuristically-chosen belief 
points rather than over the entire belief simplex. PBVI ex- 
ploits the fact that the value function is PWLC. A linear 
lower bound on the value function at one point 6 can be 
used as a lower bound at nearby points; this insight allows 
the value function to be approximated with a finite set of 
hyperplanes (often called a-vectors), one for each point. Al- 
though PBVI was designed for POMDPs, the approach has 
been generalized to PSRs 8 . Formally, given some set of 
points B = {bo, . . . , bk} in the TPSR state space, we recur- 
sively compute the value function and linear lower bounds 
at only these points. The approximation of the value func- 
tion can be represented by a set F = {ao,...,ctk} such 
that each ctj corresponds to the optimal value function at 
at least one prediction vector bi. To obtain the approximate 
value function Vt+i(6) from the previous value function Vt (b) 
we apply the recursive backup operator on points in B: if 
Vt(b) = max a6 r t a J b, then 



Vt+i(b) = max 



72(6, a) + 7 > max a 1 B ao b 



(13) 



In addition to being tractable on much larger-scale plan- 
ning problems than exact value iteration, PBVI comes with 
theoretical guarantees in the form of error bounds that are 
low-order polynomials in the degree of approximation, range 
of reward values, and discount factor 7 [17] [8]. Perseus [28| 
|11| is a variant of PBVI that updates the value function over 
a small randomized subset of a large set of reachable belief 
points at each time step. By only updating a subset of belief 
points, Perseus can achieve a computational advantage over 
plain PBVI in some domains. We use Perseus in this paper 
due to its speed and simplicity of implementation. 

5. EXPERIMENTAL RESULTS 

We have introduced a novel algorithm for learning TPSRs 
directly from data, as well as a kernel-based extension for 
modeling continuous observations, and discussed how to plan 
in the learned model. First we demonstrate the viability of 
this approach to planning in a challenging non-linear, par- 
tially observable, controlled domain by learning a model di- 
rectly from sensor inputs and then "closing the loop" by plan- 
ning in the learned model. Second, unlike previous attempts 
to learn PSRs, which either lack planning results |19[|32] , or 
which compare policies within the learned system~" [33| , we 
compare our resulting policy to a bound on the best possi- 
ble solution in the original system and demonstrate that the 
policy is close to optimal. 

5.1 The Autonomous Robot Domain 

The simulated autonomous robot domain consists of a 
simple 45 x 45 unit square arena with a central obstacle 
and brightly colored walls (Figure[ljA-B)). We modeled the 
robot as a sphere of radius 2 units. The robot can move 
around the floor of the arena, and rotate to face in any direc- 
tion. The robot has a simulated 16 x 16 pixel color camera, 
whose focal plane is located one unit in front of the robot's 
center of rotation. The robot's visual field was 45° in both 
azimuth and elevation, thus providing the robot with an an- 
gular resolution of ~ 2.8° per pixel. Images on the sensor 
matrix at any moment were simulated by a non-linear per- 
spective transformation of the projected values arising from 
the robot's position and orientation in the environment at 




Figure 1: Learning the Autonomous Robot Domain. (A) The robot uses visual sensing to traverse a square 
domain with multi-colored walls and a central obstacle. Examples of images recorded by the robot occupying 
two different positions in the environment are shown on the at the bottom of the figure. (B) A to-scale 
3-dimensional view of the environment. (C) The 2nd and 3rd dimension of the learned subspace (the first 
dimension primarily contained normalization information). Each point is the embedding of a single history, 
displayed with color equal to the average RGB color in the first image in the highest probability test. (D) 
The same points in (C) projected onto the environment's geometric space. 



that time. The resulting 768-element pattern of unprocessed 
RGB values was the only input to an robot (images were not 
preprocessed to extract features), and each action produced 
a new set of pixel values. The robot was able to move for- 
ward 1 or units, and simultaneously rotate 15°, —15°, or 
0°, resulting in 6 unique actions. In the real world, friction, 
uneven surfaces, and other factors confound precisely pre- 
dictable movements. To simulate this uncertainty, a small 
amount of Gaussian noise was added to the translation and 
rotation components of the actions. The robot was allowed 
to occupy any real- valued (x,y,8) pose in the environment, 
but was not allowed to intersect walls. In case of a collision, 
we interrupted the current motion just before the robot in- 
tersected an obstacle, simulating an inelastic collision. 

5.2 Learning a Model 

We learn our model from a sample of 10000 short tra- 
jectories, each containing 7 action-observation pairs. We 
generate each trajectory by starting from a uniformly ran- 
domly sampled position in the environment and executing 
a uniform random sequence of actions. We used the first 
I = 2000 trajectories to generate kernel centers, and the re- 
maining w — 8000 to estimate the matrices Pn, Pt,h, and 

Pt,o.o,TI- 

To define these matrices, we need to specify a set of in- 
dicative features, a set of observation kernel centers, and a 
set of characteristic features. We use Gaussian kernels to 
define our indicative and characteristic features, in a similar 
manner to the Gaussian kernels described above for observa- 
tions; our analysis allows us to use arbitrary indicative and 
characteristic features, but we found Gaussian kernels to be 
convenient and effective. Note that the resulting features 
over tests and histories are just features; unlike the kernel 
centers defined over observations, there is no need to let the 
kernel width approach zero, since we are not attempting to 
learn accurate PDFs over the histories and tests in TC and 
T. 

In more detail, we define a set of 2000 indicative kernels, 
each one centered at a sequence of 3 observations from the 



initial segment of one of our trajectories. We choose the 
kernel covariance using PCA on these sequences of observa- 
tions, just as described for single observations in Section [3.2| 
We then generate our indicative features for a new sequence 
of three observations by evaluating each indicative kernel at 
the new sequence, and normalizing so that the vector of fea- 
tures sums to one. Similarly, we define 2000 characteristic 
kernels, each one centered at a sequence of 3 observations 
from the end of one of our sample trajectories, choose a 
kernel covariance, and define our characteristic feature vec- 
tor by evaluating each kernel at a new observation sequence 
and normalizing. The initial distribution lo is, therefore, the 
distribution obtained by initializing uniformly and taking 3 
random actions. Finally, we define 500 observation kernels, 
each one centered at a single observation from the middle of 
one of our sample trajectories, and replace each observation 
by its corresponding vector of normalized kernel weights. 

Next, we construct the matrices Pn, Pt,h, and Pr,ao,n- 
As defined above, each element of Pn is the empirical ex- 
pectation (over our 8,000 training trajectories) of the cor- 
responding element of the indicative feature vector — that 
is, element i is ^ Yst=i 0*t j where 'Pit is the ith indicative 
feature, evaluated at the current history at time t. Simi- 
larly, each element of Pt,h is the empirical expectation of 
the product of one indicative feature and one characteris- 
tic feature: element i,j is h'Y^t=x < t > it < P%- Once we have 
constructed Pr,n, we can compute U as the matrix of left 
singular vectors of Pr.n- One of the advantages of subspace 
identification is that the complexity of the model can be 
tuned by selecting the number of singular vectors in U. To 
learn an exact TPSR, we should pick the first n singular 
vectors that correspond to singular values in Pr,n greater 
than some cutoff that varies with the noise resolution of our 
data. However, we may wish to pick a smaller set of sin- 
gular vectors; doing so will produce a more compact TPSR 
at the possible loss of prediction quality. We chose n — 5, 
the smallest TPSR that was able to produce high quality 
policies (see Section 5.4 below). 




Figure 2: Planning in the Learned State Space. (A) The value function computed for each embedded point; 
lighter indicates higher value. (B) Policies executed in the learned subspace. The red, green, magenta, and 
yellow paths correspond to the policy executed by a robot with starting positions facing the red, green, 
magenta, and yellow walls respectively. (C) The paths taken by the robot in geometric space while executing 
the policy. Each of the paths corresponds to the path of the same color in (B). The darker circles indicate the 
starting and ending position of each path, and the tick-mark in the circles indicates the robot's orientation. 
(D) Mean number of actions in path from 100 randomly sampled start position to the target image (facing 
blue wall). The first bar (left) is the mean number of actions in the optimal solution found by A* search in the 
robot's configuration space. The second bar (center) is the mean number of actions taken by executing the 
policy computed by Perseus in the learned model (the asterisk indicates that this mean was only computed 
over the 78 successful paths). The last bar (right) is the mean number of actions required to find the target 
with a random policy. The graph indicates that the policy computed from the learned TPSR is close to 
optimal. 



Finally, rather than computing Pr,ao,H directly, we in- 
stead compute U T Pt ,ao,H for each pair a, o: the latter ma- 
trices are much smaller, and in our experiments, we saved 
substantially on both memory and runtime by avoiding con- 
struction of the larger matrices. To construct U 1 Pr,ao,H, we 
restrict to those training trajectories in which the action at 
the middle time step (i.e., step 4) is a. Then, each element of 
Pr,ao,n is the empirical expectation (among the restricted 
set of trajectories) of the product of one indicative feature, 
one characteristic feature, and element o of the observation 
kernel vector. So, 

1 Wa 1 

U T P T ,ao,H = — V(£/ T 0f)(^) T ^(°t ~ 0) (14) 

where K(.) is the kernel function and Zt is the kernel nor- 
malization constant computed by summing over the 500 ob- 
servation kernels for each ot- Given the matrices Ph, Pt,h, 
and Pr,ao,n, we can compute the TPSR parameters using 
the equations in Section [3] 

5.3 Qualitative Evaluation 

Having learned the parameters of the TPSR, the model 
can be used for prediction, filtering, and planning in the 
autonomous robot domain. We first evaluated the model 
qualitatively by projecting the sets of histories in the train- 
ing data onto the learned TPSR state space: U Ph ■ We 
colored each datapoint according to the average of the red, 
green, and blue components of the highest probability obser- 
vation following the projected history. The features of the 
low dimensional embedding clearly capture the topology of 



the major features of the robot's visual environment (Figure 
[TJC-D)), and continuous paths in the environment translate 
into continuous paths in the latent space (Figure [2|B)). 

5.4 Planning in the Learned Model 

To test the quality of the learned model, we set up a nav- 
igation problem where the robot was required to plan a set 
of actions in order to reach a goal image (looking directly at 
the blue wall). We specified a large reward (1000) for this 
observation, a reward of —1 for colliding with a wall, and 
for every other observation. We next learned a reward 
function by linear regression from the histories embedded 
in the learned TPSR state space to the reward specified at 
each image that followed an embedded history. We used the 
reward function to compute an approximate value function 
using the Perseus algorithm with discount factor 7 = .8, a 
prediction horizon of 10 steps, and with the 8000 embedded 
histories as the set of belief points. The learned value func- 
tion is displayed in Figure A). Once the approximate value 
function has been learned, and an initial belief specified, the 
robot greedily chooses the action which maximizes the ex- 
pected value. The initial beliefs were computed by starting 
with 61 and then incorporating 3 random action-observation 
pairs. Examples of paths planned in the learned model are 
presented in Figure [5|B); the same paths are shown in geo- 
metric space (recall that the robot only has access to images; 
the geometric space is never observed by the robot) in Fig- 
ure[2jC). Note that there are a set of valid target positions in 
the environment since one can receive an identical close-up 
image of a blue wall from anywhere along the corresponding 
edge of the environment. 



The reward function encouraged the robot to navigate to 
a specific set of points in the environment, therefore the 
planning problem can be viewed as solving a shortest path 
problem. Even though we don't encode this intuition into 
our algorithm, we can use it to quantitatively evaluate the 
performance of the policy in the original system. First we 
randomly sampled 100 initial histories in the environment 
and asked the robot to plan a path based on its learned pol- 
icy. The robot was able to reach the goal in 78 of the trials. 
In 22 trials, the robot got stuck repeatedly taking alternat- 
ing actions whose effects cancelled (for example, alternating 
between turning —15° and 15°). 5 When the robot was able 
to reach the goal, we compared the number of actions taken 
both to the minimal path, calculated by A* search in the 
robot's configuration space given the true underlying posi- 
tion, and to a random policy. Note that comparison to the 
optimal policy is somewhat unfair: in order to recover the 
optimal policy the robot would have to know its true under- 
lying position (which is not available to it), our model as- 
sumptions would have to be exact, and the algorithm would 
need an unlimited amount of training data. The results, 
summarized in Figure [2|D), indicate that the TPSR policy 
is close to the optimal policy in the original system. We 
think that this result is remarkable, especially given that 
previous approaches have encountered significant difficulty 
modeling continuous domains [12] and domains with simi- 
larly high levels of complexity |33| . 

6. CONCLUSIONS 

We have presented a novel consistent subspace identifi- 
cation algorithm that simultaneously solves the discovery 
and learning problems for TPSRs. In addition, we provided 
two extensions to the learning algorithm that are useful in 
practice, while maintaining consistency: characteristic and 
indicative features only require one to know relevant fea- 
tures of tests and histories, rather than sets of core tests 
and histories, while kernel density estimation can be used to 
find observable operators when observations are real- valued. 
We also showed how point-based approximate planning tech- 
niques can be used to solve the planning problem in the 
learned model. We demonstrated the representational ca- 
pacity of our model and the effectiveness of our learning 
algorithm by learning a very compact model from simulated 
autonomous robot vision data. We closed the loop by suc- 
cessfully planning with the learned models, using Perseus to 
approximately compute the value function and optimal pol- 
icy for a navigation task. To our knowledge this is the first 
instance of learning a model for a simulated robot in a par- 
tially observable environment using a consistent algorithm 
and successfully planning in the learned model. We com- 
pare the policy generated by our model to a bound on the 
best possible value, and determine that our policy is close 
to optimal. 

We believe the spectral PSR learning algorithm presented 
here, and subspace identification procedures for learning 
PSRs in general, can increase the scope of planning under 
uncertainty for autonomous agents in previously intractable 
scenarios. We believe that this improvement is partly due to 

5 In an actual application, we believe that we could avoid 
getting stuck by performing a short lookahead or simply by 
randomizing our policy; for purposes of comparison, how- 
ever, we report results for the greedy policy. 



the greater representational power of the PSR as compared 
to POMDPs and partly due to the efficient and statistically 
consistent nature of the learning method. 
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